PIRSF: family classi®cation system at the Protein Information Resource

نویسندگان

  • Cathy H. Wu
  • Anastasia Nikolskaya
  • Hongzhan Huang
  • Lai-Su L. Yeh
  • Darren A. Natale
  • C. R. Vinayaka
  • Zhang-Zhi Hu
  • Raja Mazumder
  • Sandeep Kumar
  • Panagiotis Kourtesis
  • Robert S. Ledley
  • Baris E. Suzek
  • Leslie Arminski
  • Yongxing Chen
  • Jian Zhang
  • Jorge Louie Cardenas
  • Sehee Chung
  • Jorge Castro-Alvear
  • Georgi Dinkov
  • Winona C. Barker
چکیده

The Protein Information Resource (PIR) is an integrated public resource of protein informatics. To facilitate the sensible propagation and standardization of protein annotation and the systematic detection of annotation errors, PIR has extended its superfamily concept and developed the SuperFamily (PIRSF) classi®cation system. Based on the evolutionary relationships of whole proteins, this classi®cation system allows annotation of both speci®c biological and generic biochemical functions. The system adopts a network structure for protein classi®cation from superfamily to subfamily levels. Protein family members are homologous (sharing common ancestry) and homeomorphic (sharing full-length sequence similarity with common domain architecture). The PIRSF database consists of two data sets, preliminary clusters and curated families. The curated families include family name, protein membership, parent±child relationship, domain architecture, and optional description and bibliography. PIRSF is accessible from the website at http://pir.georgetown.edu/pirsf/ for report retrieval and sequence classi®cation. The report presents family annotation, membership statistics, cross-references to other databases, graphical display of domain architecture, and links to multiple sequence alignments and phylogenetic trees for curated families. PIRSF can be utilized to analyze phylogenetic pro®les, to reveal functional convergence and divergence, and to identify interesting relationships between homeomorphic families, domains and structural classes. INTRODUCTION The Protein Information Resource (PIR) is an integrated public bioinformatics resource that supports genomic and proteomic research and scienti®c studies. For over three decades, PIR has provided many protein databases and analysis tools freely accessible to the scienti®c community, including the PIR-International Protein Sequence Database (PSD) of functionally annotated protein sequences, which grew out of the Atlas of Protein Sequence and Structure (1) edited by Margaret Dayhoff. PIR has recently joined forces with the European Bioinformatics Institute (EBI) and the Swiss Institute of Bioinformatics (SIB) to establish UniProt (the Universal Protein Knowledgebase) (2), the central resource of protein sequence and function, by unifying the database activities of PIR-PSD, Swiss-Prot and TrEMBL. In addition, we have implemented the new PIRSF (SuperFamily) classi®cation system, which is described below. We have also enhanced iProClass (3), an integrated database of protein family, function and structure information with executive summaries and cross-references to over 50 molecular databases; maintained PIR-NREF (4), a non-redundant reference database; and improved the PIR website for scienti®c inquiry and system dissemination. PIRSF SYSTEM DEFINITION The PIR superfamily/family concept (5), the original classi®cation based on sequence similarity, has been used as a guiding principle to provide comprehensive and non-overlapping clustering of PIR protein sequences into a hierarchical order to re ̄ect their evolutionary relationships (6). To facilitate the sensible propagation and standardization of protein annotation and the systematic detection of annotation errors as part of the UniProt project, PIR has extended its hierarchical superfamily concept and developed the PIRSF system, a `network classi®cation system based on the evolutionary relationships of whole proteins'. Classi®cation based on whole proteins, rather than on the component domains, *To whom correspondence should be addressed. Tel: +1 202 687 2121; Fax: +1 202 687 1662; Email: pirmail@georgetown.edu D112±D114 Nucleic Acids Research, 2004, Vol. 32, Database issue DOI: 10.1093/nar/gkh097 Nucleic Acids Research, Vol. 32, Database issue ã Oxford University Press 2004; all rights reserved allows annotation of both generic biochemical and speci®c biological functions. Furthermore, it permits the classi®cation of proteins without well-de®ned domains. The network classi®cation system accommodates a ̄exible number of levels that re ̄ect varying degrees of sequence conservation. Such structure allows improved protein annotation, more accurate extraction of conserved functional residues and classi®cation of distantly related orphan proteins. The primary level for curation is the homeomorphic family, which consists of proteins that are both homologous (evolved from a common ancestor as inferred by detectable sequence similarity) and homeomorphic (sharing full-length sequence similarity and a common domain architecture). Common domain architecture is indicated by the same type, number and order of core domains. Variation may exist for repeating domains and/or auxiliary domains, which are often mobile and may be easily lost, acquired or functionally replaced during evolution. Above the `homeomorphic family' nodes in the network structure are parent superfamily nodes that connect distantly related families and orphan proteins based on common domains. They may be homeomorphic superfamilies, but are more likely to be domain superfamilies if the common domain regions do not extend over the entire full-length proteins. Below the homeomorphic family nodes are child subfamily nodes, which are homologous and homeomorphic clusters representing functional specialization and/or domain architecture variation within a family. The PIRSF system de®nition and working principles are detailed in the document, A Proposal for the PIRSF Classi®cation System, available from the PIR website. PIRSF DATABASE CREATION AND CURATION The PIRSF database consists of two data sets, preliminary clusters and curated families. Currently, about two-thirds of UniProt sequences are classi®ed into over 32 000 preliminary clusters, including single-member clusters. The preliminary clusters are computationally de®ned using both pairwisebased parameters (% sequence identity, sequence length ratio and overlap length ratio) and cluster-based parameters (% matched members, distance to neighboring clusters and overall domain arrangement). Systematic family curation is being conducted in a two-tier process to improve the quality of automated classi®cation. Over 4500 families containing two or more members have been curated at the `®rst-tier' for membership and domain architecture characteristic of the family. PIRSF has two membership types: regular members for proteins sharing endto-end sequence similarity and associate members for proteins whose lengths deviate from the family length range, including incomplete sequences, alternate splice and initiator variants, and peptides derived from proteolytic processing. A subset of representative regular members is chosen as seed members for generating multiple sequence alignments, phylogenetic trees and Hidden Markov Models (HMMs) of the respective families. The second-tier curation provides additional annotation, including family name, parent±child relationship, family description and bibliography. Several hundred second-tier curated PIRSF families have been integrated into InterPro (7). The incorporation of PIRSF families into InterPro and the implementation of a system to check the validity and integrity of existing families create additional means of ensuring accuracy and consistency in UniProt classi®cation and annotation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PIRSF Family Classification System for Protein Functional and Evolutionary Analysis

The PIRSF protein classification system (http://pir.georgetown.edu/pirsf/) reflects evolutionary relationships of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture). PIRSF families are cura...

متن کامل

The PIR integrated protein databases and data retrieval system

The Protein Information Resource (PIR) provides many databases and tools to support genomic and proteomic research. PIR is a member of UniProt––Universal Protein Resource––the central repository of protein sequence and function, which maintains UniProt Knowledgebase with extensively curated annotation, UniProt Reference databases to speed sequence searches, and UniProt Archive to reflect sequen...

متن کامل

Classifying Predicates and Languages

The present paper studies a particular collection of classi cation problems, i.e., the classi cation of recursive predicates and languages, for arriving at a deeper understanding of what classi cation really is. In particular, the classi cation of predicates and languages is compared with the classi cation of arbitrary recursive functions and with their learnability. The investigation undertake...

متن کامل

The iProClass integrated database for protein functional analysis

Increasingly, scientists have begun to tackle gene functions and other complex regulatory processes by studying organisms at the global scales for various levels of biological organization, ranging from genomes to metabolomes and physiomes. Meanwhile, new bioinformatics methods have been developed for inferring protein function using associative analysis of functional properties to complement t...

متن کامل

Binary Feature Selection with Conditional Mutual Information

In a context of classi cation, we propose to use conditional mutual information to select a family of binary features which are individually discriminating and weakly dependent. We show that on a task of image classi cation, despite its simplicity, a naive Bayesian classi er based on features selected with this Conditional Mutual Information Maximization (CMIM) criterion performs as well as a c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004